Finding Groups in Chronologically-Ordered Corpus Data: Variance-Based Neighbor Clustering
نویسنده
چکیده
Much corpus-linguistic research is concerned with the development of particular parameters over time. For example, in L1/L2 acquisition/learning, the syntactic development of a child/learner is approximated on the basis of how mean lengths of utterances (MLU), t-unit-based measures, or IPSyn values change over time (cf. Shirai and Andersen 1995 or Ortega 2003). Similarly, in historical linguistics, an expression’s degree of grammaticalization is often approximated via the percentages of the expression’s use as a lexical or grammatical element change over time; cf. Svensson (2000). Most such studies aim at representing this variation in terms of stages. The probably best known example is that of Brown’s (1973) MLU stages, which underlie much work on L1 acquisition, but cf., say, Hilpert (2006) for a diachronic example. However, so far no broadly applicable yet principled method to do this has been developed. In language acquisition, Brown’s cut-off points are essentially arbitrary; elsewhere, the data are just split up into n equally-sized parts, where equally-sized variously refers to amounts of time or numbers of items. I will introduce a completely data-driven method, variance-based neighbor clustering (VNC), that takes as input chronologically-ordered corpus data and solves this problem. It is similar to standard clustering approaches because clustering is performed objectively using quantitative information and represented graphically in the form of dendrograms. VNC differs from standard approaches because it only clusters neighboring data points, thus preserving the data points’ temporal sequence. I will discuss the advantages of this methods on the basis of results from two submitted case studies, one based on the tense-aspect acquisition from the Stoll corpus of Russian L1 acquisition, the other based on data regarding infinitives following shall from the Penn Parsed Corpora of Historical English
منابع مشابه
Neighbor-finding based on space-filling curves
Nearest neighbor-finding is one of the most important spatial operations in the field of spatial data structures concerned with proximity. Because the goal of the space-filling curves is to preserve the spatial proximity, the nearest neighbor queries can be handled by these space-filling curves. When data is ordered by the Peano curve, we can directly compute the sequence numbers of the neighbo...
متن کاملAn investigation into the stability of contextual document clustering
Document Clustering (CDC) as a means of indexing within a dynamic and rapidly changing environment. We simulate a dynamic environment, by splitting two chronologically ordered datasets into time-ordered segments and assessing how the technique performs under two different scenarios. The first is when new documents are added incrementally without reclustering [incremental CDC (iCDC)], and the se...
متن کاملChoosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation
1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...
متن کاملFUZZY K-NEAREST NEIGHBOR METHOD TO CLASSIFY DATA IN A CLOSED AREA
Clustering of objects is an important area of research and application in variety of fields. In this paper we present a good technique for data clustering and application of this Technique for data clustering in a closed area. We compare this method with K-nearest neighbor and K-means.
متن کاملA Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset
Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...
متن کامل